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Abstract. We give families of examples where sharp rates of conver- 
gence to stationarity of the widely used Gibbs sampler are available. 
The examples involve standard exponential families and their conjugate 
priors. In each case, the transition operator is explicitly diagonalizable 
with classical orthogonal polynomials as eigenfunctions. 

Key words and phrases: Gibbs sampler, running time analyses, expo- 
nential families, conjugate priors, location families, orthogonal polyno- 
mials, singular value decomposition. 
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1. INTRODUCTION 

The Gibbs sampler, also known as Glauber dy- 
namics or the heat-bath algorithm, is a mainstay of 
scientific computing. It provides a way to draw sam- 
ples from a multivariate probability density f{xi,X2, 
...,Xp), perhaps only known up to a normalizing 
constant, by a sequence of one-dimensional sampling 
problems. From {Xi, . . . , Xp) proceed to {X[,X2, ■ ■ ■ , 
Xp), then {X[,X2,X3, . . .,Xp), . . . , {X'i,X'2, . . . ,X'p) 
where at the ith stage, the coordinate is sampled 
from / with the other coordinates fixed. This is one 
pass. Continuing gives a Markov chain X, X' ,X" , . . . , 
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which has / as stationary density under mild con- 
ditions discussed in [4, 102]. 

The algorithm was introduced in 1963 by Glauber 
[49] to do simulations for Ising models, and inde- 
pendently by Turcin [103]. It is still a standard tool 
of statistical physics, both for practical simulation 
(e.g., [88]) and as a natural dynamics (e.g., [10]). 
The basic Dobrushin uniqueness theorem showing 
existence of Gibbs measures was proved based on 
this dynamics (e.g., [54]). It was introduced as a 
base for image analysis by Geman and Geman [46]. 
Statisticians began to employ the method for rou- 
tine Bayesian computations following the works of 
Tanner and Wong [101] and Gelfand and Smith [45]. 
Textbook accounts, with many examples from bi- 
ology and the social sciences, along with extensive 
references are in [47, 48, 78]. 

In any practical application of the Gibbs sampler, 
it is important to know how long to run the Markov 
chain until it has forgotten the original starting state. 
Indeed, some Markov chains are run on enormous 
state spaces (think of shuffling cards or image anal- 
ysis) . Do they take an enormous number of steps to 
reach stationarity? One way of dealing with these 
problems is to throw away an initial segment of the 
run. This leaves the question of how much to throw 
away. We call these questions running time analyses 
below. 

Despite heroic efforts by the applied probability 
community, useful running time analyses for Gibbs 
sampler chains is still a major research effort. An 
overview of available tools and results is given at the 
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end of this introduction. The main purpose of the 
present paper is to give famihes of two component 
examples where a sharp analysis is available. These 
may be used to compare and benchmark more ro- 
bust techniques such as the Harris recurrence tech- 
niques in [64], or the spectral techniques in [2] and 
[107]. They may also serve as a base for the compar- 
ison techniques in [2, 26, 32]. 

Here is an example of our results. The following 
example was studied as a simple expository example 
in [16] and [78], page 132. Let 

TT^dO) = uniform on (0, 1), x G {0, 1, 2, . . . , n}. 

These define the bivariate Beta/Binomial density 
(uniform prior) 



fix, 6) 
with marginal density 

m{x)= f{x,e)de 
Jo 



^^(1-0)"- 



n + 1' 



xe{0,l,2,...,n}. 



The posterior density [with respect to the prior 
7r((i^)] is given by 



iT{e\x) 



fe{x) 



m[x) 

= (n+l)(^)^^■(l-^^-^ ^G(o,i). 

The Gibbs sampler for /(x, 9) proceeds as follows: 

• From X, draw 9' from Beta(a; + 1, n — x -|- 1). 

• From 9' , draw x' from Binomial(n, 0'). 

The output is {x',9'). Let K{x,9;x' ,9') be the tran- 
sition density for this chain. While K has f{x,9) as 
stationary density, the {K, f) pair is not reversible 
(see below). This blocks straightforward use of spec- 
tral methods. Liu et al. [77] observed that the "x- 
chain" with kernel 



k{x,x')= / fe{x')TT{9\x)d9 
Jo 



fe{x)fe{x') 
m{x) 



o 

0. 




50 100 150 

Number of steps 

Fig. 1. Simulation of the Beta/Binomial "x-chain" with 
n = 100. 

is reversible with stationary density ■m{x) [i.e., m(x) • 
k{x,x') = m{x')k{x' ,x)]. For the Beta/Binomial ex- 
ample 

^ + 1 Oix') 



k{x, x) 



< X, x' < n. 



d9 



A simulation of the Beta/Binomial "x-chain" (1.1) 
with n = 100 is given in Figure 1. The initial posi- 
tion is 100 and we track the position of the Markov 
chain for the first 200 steps. 

The proposition below gives an explicit diagonal- 
ization of the 3>chain and sharp bounds for the bi- 
variate chain [K^ g denotes the density of the distri- 
bution of the bivariate chain after i steps starting 
from {n,9)]. It shows that order n steps are neces- 
sary and sufficient for convergence. That is, for sam- 
pling from the Markov chain K to simulate from the 
probability distribution /, a small (integer) multi- 
ple of n steps suffices, while ^ steps do not. The 
following bounds make this quantitative. The proof 
is given in Section 4. For example, when n = 100, 
after 200 steps the total variation distance (see be- 
low) to stationarity is less than 0.0192, while after 
50 steps the total variation distance is greater than 
0.1858, so the chain is far from equilibrium. 

We simulate 3000 independent replicates of the 
Beta/Binomial "x-chain" (1.1) with n= 100, start- 
ing at the initial value 100. We provide histograms 
of the position of the "x-chain" after 50 steps and 
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Fig. 2. Histograms of the observations obtained from simulating 3000 independent replicates of the Beta/Binomial "x-chain" 
after 50 steps and 200 steps. 



200 steps in Figure 2. Under the stationary distribu- 
tion (which is uniform on {0, 1, . . . , 100}), one would 
expect roughly 150 observations in each block of the 
histogram. As expected, these histograms show that 
the empirical distribution of the position after 200 
steps is close to the stationary distribution, while 
the empirical distribution of the position after 50 
steps is quite different. 

If f,g are probability densities with respect to a 
a-finite measure /i, then the total variation distance 
between / and g is defined as 

(1.2) \\f-g\\^^ = l J \f{x)-g{x)\fi{dx). 

Proposition 1.1. For the Beta/Binomial ex- 
ample with uniform prior, we have: 

(a) The chain (1.1) has eigenvalues 
/3o = l, 

(n + 2)(n + 3)---(n+j + l)' 

In particular, f3i = l — 2/{n + 2). The eigenf unctions 
are the discrete Chebyshev polynomials [orthogonal 
polynomials for m{x) = l/(n + 1) on {0, ... , n}] . 

(b) For the bivariate chain K, for all 9,n and i, 

2/3f<lK.-/llTv<Y^i-^. 



The calculations work because the operator with 
density k{x,x') takes polynomials to polynomials. 
Our main results give classes of examples with the 
same explicit behavior. These include: 

• fe{x) is one of the exponential families singled 
out by Morris [86, 87] (binomial, Poisson, negative 
binomial, normal, gamma, hyperbolic) with tt{9) 
the conjugate prior. 

• feix) = g{x — 9) is a location family with vr(0) con- 
jugate and g belongs to one of the six exponential 
families above. 

Section 2 gives background. In Section 2.1 the 
Gibbs sampler is set up more carefully in both sys- 
tematic and random scan versions. Relevant Markov 
chain tools are collected in Section 2.2. Exponential 
families and conjugate priors are reviewed in Section 
2.3. The six families are described more carefully 
in Section 2.4 which calculates needed moments. A 
brief overview of orthogonal polynomials is in Sec- 
tion 2.5. 

Section 3 is the heart of the paper. It breaks the 
operator with kernel k{x,x') into two pieces: T : 
L^(m) L'^{t^) defined by 

Tg{0)= j fe{x)g{x)m{dx) 

and its adjoint T* . Then k is the kernel of T*T. Our 
analysis rests on a singular value decomposition of 
T. In our examples, T takes orthogonal polynomials 
for m[x) into orthogonal polynomials for ■k{9). This 
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leads to explicit computations and allows us to treat 
the random scan, x-cliain and ^-chain on an equal 
footing. 

The x-chains and ^-chains corresponding to three 
of the six classical exponential families are treated 
in Section 4. There are some surprises; while order 
n steps are required for the Beta/Binomial exam- 
ple above, for the parallel Poisson/Gamma example, 
logn + c steps are necessary and sufficient. The six 
location chains are treated in Section 5 where some 
standard queuing models emerge (e.g., the M/M/oo 
queue). The final section points to other examples 
with polynomial eigenfunctions and other methods 
for studying present examples. Our examples are 
just illustrative. It is easy to sample from any of 
the families f{x,0) directly. Further, we do not see 
how to carry our techniques over to higher compo- 
nent problems. We further point out that in routine 
Bayesian use of the Gibbs sampler, the distribution 
we wish to sample from is typically a posterior dis- 
tribution of the parameters. Nevertheless, our two 
component problems are easy-to-understand stan- 
dard statistical examples. 

Basic convergence properties of the Gibbs sampler 
can be found in [4, 102]. Explicit rates of conver- 
gence appear in [95, 96]. These lean on Harris recur- 
rence and require a drift condition of type 
E{V{Xi)\Xo = x)< aV{x) + b for ah x. Also re- 
quired are a minorization condition of the form 
k{x,x') > eq{x') for e > 0, some probability density 
q, and all x with V{x) < d. Here d is fixed with 
d>h/ {1 + a). Rosenthal [95] then gives explicit up- 
per bounds and shows these are sometimes practi- 
cally relevant for natural statistical examples. His 
paper [97] is a nice expository account. Finding use- 
ful V and q is currently a matter of art. For example, 
a group of graduate students tried to use these tech- 
niques in the Beta/Binomial example treated above 
and found it difficult to make choices giving useful 
results. This led to the present paper. 

A marvelous expository account of this set of tech- 
niques with many examples and an extensive liter- 
ature review is given by Jones and Hobert in [64]. 
In their main example an explicit eigenfunction was 
available for V] our Gamma/Gamma examples be- 
low generalize this. Further, "practically relevant" 
examples with useful analyses of Markov chains for 
stationary distributions which are difficult to sample 
directly from are in [65, 81, 98]. Some sharpenings 
of the Harris recurrence techniques are in [9] which 



also makes useful connections with classical renewal 
theory. 

The Markov chains studied in this paper generate 
stochastic processes which have the marginal as sta- 
tionary distribution. These chains have been used 
in a variety of applied modeling contexts in [80] and 
[89, 90, 91]. The present paper presents some new 
tools for these models. A related family of Markov 
chains and analyses have been introduced by Eaton 
[33, 34, 35] to study admissibility of formal Bayes 
estimators. The process is a two-component Gibbs 
sampler, as above, with vr an improper prior having 
an almost surely proper posterior. Under regular- 
ity assumptions, Eaton shows that recurrence of the 
^-chain implies the admissibility of certain formal 
Bayes estimators corresponding to vr. Hobert and 
Robert [61] have developed this research, introduc- 
ing the x-chain as a useful tool. Spectral techniques 
have proved to be a useful adjunct to Foster-type 
criteria for studying recurrence. We hope to develop 
our theory in these directions. 

The analyses carried out in this paper hinge criti- 
cally on the existence of all marginal and conditional 
moments. These moments need not exist. Indeed, 
consider the geometric density fe{x) = 9^{l — 9),0< 
X < oo, and put a Beta(a,/3) prior on 0, with a > 1 
being an integer. The marginal distribution of x ad- 
mits only a — 1 moments. For a > 1, the available 
moments are put to good use in [25]. 

2. BACKGROUND 

This section gives needed background. The two- 
component Gibbs sampler is defined more carefully 
in Section 2.1. Bounds on convergence using eigen- 
values are given in Section 2.2. Exponential families 
and conjugate priors are reviewed in Section 2.3. 
The six families with variance a quadratic function 
of the mean are treated in Section 2.4. Finally, a 
brief review of orthogonal polynomials is in Section 
2.5. 

2.1 Two-Component Gibbs Samplers 

Let be a measurable space equipped with 

a cr-finite measure Let (O,^) be a measurable 
space equipped with a probability measure 7r(d0). 
In many classical situations the prior is given by a 
density g{6) with respect to the Lebesgue measure 
d9, and so Tr{d6) = g{9) dO. However, we also con- 
sider examples where the parameter 6 is discrete, 
which cannot be described in the fashion mentioned 
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above. Thus, we work with general prior probabili- 
ties 7r{d9) throughout. Let {fe{x)}g^Q be a family of 
probability densities with respect to ^. These define 
a probability measure on X x Q via 



P{AxB)= / fe{x)^i{dx)7r{de), 
Jb J a 

AeT, Beg. 

The marginal density on X is 



m{x) 



e 



fe{x)7r{d9) 

m(x)fi{dx) 



so 



1 



We assume that < m{x) < oo for every x £ X. The 
posterior density with respect to Tr(d9) is given by 



TT 



\x) = fe{x)/m{x). 



The probability P splits with respect to m{dx) 
m{x)fi{dx) in the form 



P{A X B) 



TT 



AJB 



\x)7r {d6)m{dx), 

AeJ", Beg. 



The systematic scan Gibbs sampler for drawing 
from the distribution P proceeds as follows: 

• Starting from {x,9), first, draw x' from fg{-); sec- 
ond, draw 9' from tt{-\x'). 

The output is {x',9'). This generates a Markov 
chain (x, 9) — > (x' , 9') ^ ■ ■ ■ having kernel 

K{x,9;x',9') = fe{x')fe>{x')/m{x') 

with respect to iJ,{dx')7r{d9'). A slight variant ex- 
changes the order of the draws: 

• Starting from (x, 9), first, draw 9' from 7r(-|a;); sec- 
ond, draw x' from /e'(-). 



Note that J k{x,x')fi{dx') = 1 so that k{x,x') is a 
probability density with respect to /.i. Note further 
that m{x)k{x, x') = m{x')k{x' , x) so that the x-chain 
has m{dx) as a stationary distribution. 

The "0-chain" [from 0, draw x from fe{x) and 
then 9' from 7r(0'|x)] has transition density 



k{9,9') 



(2.2) 



X 



fo{x)7r{9'\x)fj.{dx) 
fe{x)fe'{x) 



X 



m{x) 



-fi{dx). 



Note that j k{9 ,9')'n{d9') = 1 and that k{9,9') has 
Tr{d9) as reversing measure. 

Example (Poisson/Exponential). Let X = {0, 1, 
2,3,...}, n{dx) = counting measure, = (O,cxo), 
fe{x) = e-^r/x!. Take 7T{d9) = e-^d9. Then m{x) = 

Jo x\ 



'd9: 



1/2^"'"^. The posterior density with 
respect to 7r{d9) is Tr{9\x) = fg{x)/m{x) = 2^~^^e~^9^ /x\. 
Finally, the x-chain has kernel 



k{x, x') 







X!X 




whereas the 0-chain has kernel 
, g {299'Y _ 

x=0 



d9 



< X, x' < oo. 



k{9,( 



2e 



(x!) 



2e" 



') 



with respect to 7r{d9), where Iq is the classical mod- 
ified Bessel function; see Feller [44], Section 2.7, for 
background. 

A second construction called the random scan chain 
is frequently used. From {x,9), pick a coordinate at 
random and update it from the appropriate condi- 



The output is (x', 9'). The corresponding Markov tional distribution. More formally, for g G ^^(P) 



chain (x, 9) — > (x', 0') — > • • • has kernel 

K{x,9;x',9') = fe>{x)fe'{x')/m{x) 

with respect to fi{dx')TT(d9'). Under mild condi- 
tions these two chains have stationary distribu- 
tion P. 

The "x-chain" [from x, draw 9' from ■7r{9'\x) and 
then x' from fg'{x')] has transition kernel 



Kg{x, 



(2.3) 



+ 



g{x,9')Tr{9'\x)TT{d9') 
9{x',9)fe{x')n{dx'). 



X 



k{x, x') 



(2.1) 



e 



T,{9\x)fe{x')TT{d9) 
fe{x)fe{x') 



e 



m(x) 



-7r{d9). 



We note three things. First, K sends L?'{P) —>■ 
L'^{P) and is reversible with respect to P. This is 
the usual reason for using random scans. Second, 
the right-hand side of (2.3) is the sum of a func- 
tion of X alone and a function of 9 alone. That is, 
K : L'^{P) L'^{m) + L'^{tt) [the range of K is con- 
tained in L'^{m) + L'^{tt)]. Third, if 5 G {L'^{m) + 
L^(7r))-'- [complement in L^(P)], then Kg = [Ker 
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K_-D {L^{m) + L^{tt))^]. Indeed, for any h G L^{P), 
{Kg,h)p = J{Kg)hdP = J{Kh)gdP = 0. Thus 
Kg = 0. We diagonalize random scan chains in Sec- 
tion 3. 

2.2 Bounds on Markov Chains 

2.2.1 General results. We briefly recah well-known 
results that will be applied to either our two-compo- 
nent Gibbs sampler chains or the x- and 0-chains. 
Suppose we are given a Markov chain described by 
its kernel K{(,,^') with respect to a measure (i{d^') 
[e.g., ^ = {x,9), = fi{dx)-K{d6) for the two-com- 

ponent sampler, =9, jl{d9) = 7r{d9) for the ^-chain, 
etc.]. Suppose further that the chain has stationary 
measure m{d^) =m{^)fl{d^) and write 

for the kernel and iterated kernel of the chain with 
respect to the stationary measure m{d^). We define 
the chi-square distance between the distribution of 
the chain started at ^ after I steps and its stationary 
measure by 



m') 



\KKW)-m{i')\'' 
m(f) 



This quantity always yields an upper bound on the 
total variation distance 



namely, 
(2.4) 



i j \kl{i') - l|m«) 



\K'{U')-m{0\m') 



< 



Our analysis will be based on eigenvalue decom- 
positions. Let us first assume that we are given a 
function (j) such that 

Km= I K{tc'm')m')=pm, 



H4>) = j m^midi) = 

for some (complex number) (3. In words, (/) is a gener- 
alized eigenfunction with eigenvalue (3. We say "gen- 
eralized" here because we have not assumed here 
that (j) belongs to a specific space [we only assume 



we can compute K(f) and m (</>)]. The second condi- 
tion [orthogonality to constants in L^(m)] will be 
automatically satisfied when < 1. Such an eigen- 
function yields a simple lower bound on the conver- 
gence of the chain to its stationary measure. 

Lemma 2.1. Referring to the notation above, as- 
sume that cj) £ (m{dS,)) and J dm = 1 . Then 

Moreover, if (j) is a bounded function, then 



\Kl - m\ 



> 



Proof. This follows from the well-known results 
(2.5) ,,2,,^_ _ n^^^r.^ _^.m2i 



x|( 



and 
(2.6) 



sup {\Kl{g)-m{gr} 

Ilsll2,m<l 



m||Tv = ^ sup {\Kl{g) - m{g)\}. 

ll9lioo<l 



Here, K^^{g) denotes the expectation of g under the 
density K^^, •). For chi-square, use g = (p as a test 



function. For total variation use 5 = 0/1101100 as a 
test function. More sophisticated lower bounds on 
total variation are based on the second moment method 
(e.g., [99, 106]). □ 

To obtain upper bounds on the chi-square dis- 
tance, we need much stronger hypotheses. Namely, 
assume that K is a self-adjoint operator on L'^{m) 
and that (m) admits an orthonormal basis of real 
eigenfunctions ipi with real eigenvalues /3j > 0, /3o = 
1, (fo = 1, A i so that 

Assume further that K acting on L^{m) is Hilbert- 
Schmidt (i.e., J2 lAI^ < oo). Then we have 

i^'(e,e') = Efe(0^.(e') 

i 

[convergence in Lp'[m x m)] 



and 
(2.7) 



x|(^) = E/3.'V?(0- 

j>0 



Useful references for this part of classical functional 
analysis are [1, 94]. 
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^-\x,z)^fe>{x')fi{dz). 
m[z) 



2.2.2 Application to the two- component Gibbs sam- For the variant K, the similar formula reads 

pier. All of the bounds in this paper are derived via _ 

the following route: bound by and use the ex- K {x,6;x ,9 ) = / k 
plicit knowledge of eigenvalues and eigenfunctions to 

bound the sum in (2.7). This, however, does not ap- These two bivariate chains have stationary density 

ply directly to the two-component Gibbs sampler K f^^J^ = ^^(^) ^^^h respect to the measure fi{dx) ■ 

(or K) because these chains are not reversible with ^' ^' °' write 



respect to their stationary measure. Fortunately, the 
x-chain and the 0-chain are reversible and their anal- 
ysis yields bounds on the two-component chain thanks 
to the following elementary observation. The x-chain 
has kernel k{x, x') with respect to the measure fi{dx). 
It will also be useful to have k{x,x') = k{x,x')/m{x'), 
the kernel with respect to the probability m{dx) = 
m{x)fi{dx). For I >2, we let k^{x') = k^{x,x') = 
J k{x,y)k^~^{y,x')fi{dy) denote the density [w.r.t. 
n{dx)] of the distribution of the x-chain after I steps 
and set fc^(x') = k^{x, x') = / k{x, y)k^~^{y, x')m{dy) 
[the density w.r.t. m{dx)]. Also 

i/p 



\\9\\p,P 



\g{u^)\^P{du] 



for p > 1. 



Lemma 2.2. Referring to the K,K two- 
component chains and x-chain introduced in Sec- 
tion 2.1, for any p £ [l,oo], we have 



mL/f)-i\ 



p,p 



< 



\k 



< sup \ \k' 



M\p,mf9iz)li{dz) 



IF 

llp,m 



and 



mio/f) 

Similarly, for the 9 -chain, we have 



IF 



mie/f) 



1 11^ < 
^\\p,p - 



\k 



e-1 



lF,M0\^Md9) 



< sup ||A:^ 



i-i 



IF 



and 



mie/f) 



^F < 



e-i 



IF 

llp,7r" 



Proof. We only prove the results involving the 
x-chain. The rest is similar. Recall that the bivariate 
chain has transition density 

K{x,9;x',9') = fgix')fe'ix')/m{x'). 

By direct computation 

K^{x,9;x',9') -- 



f,iz)k'-\z,x')^^Kdz). 



m{x') 



K\x,9;x',9') 
fix', 9') ^ 

\k^-\z,x') 



l)feiz)t,idz) 



and 



K^{x,9;x',9') 



1 



fix', 9') 

''-\x,z)-l)fe>iz)f,idz). 

To prove the second inequality in the lemma (the 
proof of the first is similar), write 

/f)-Mfp,P 



[Ki 



Ck'-Hx,z)-l)fe,iz)fi{dz] 
■ fe'ix')fiidx')7rid9') 



< 



k''-\x,z)-l\P 

• fe'iz)fJ-idz)fe'ix')fj,{dx')7r{d9') 
\k^-\x,z) -l\Pmiz)n{dz) 

\k^-\x,z) -l\Pmidz). 

This gives the desired bound. □ 

To get lower bounds, we observe the following. 

Lemma 2.3. Let g be a function of x only [abus- 
ing notation, g(x,9) = gix)]. Then 

Kgix,9)=fkix,x')gix'Mdx'). 

on of 9 only, then 
kie,9')gie')Tr{d9'). 
(x). Then 
g{x')nidx')Tr{d9') 



If instead, g is a function of 9 only, then 
Kg{x,9)-- 



Proof. Assume ^(x,^) = 5r(x). Then 
Kgix,e)- / / ^'(^)^'(^') 



m(x) 

k{x, x')gix')^idx') 
The other case is similar. □ 
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Lemma 2.4. Let xlei^) '^"■^ xleW be the chi- 
square distances after £ steps for the K-chain and 
the K-chain, respectively, starting at {x,9). LetXx{^)j 
Xg(^) be the chi- square distances for the x- chain 
{starting at x) and the 6-chain {starting at 9), re- 
spectively. Then we have 

xl{t)<xle{^)<xl{^-l). 

W^e ~ i||tv !^ W-^xfi ~ /IItv ^ ll^g ~ i||tv 

and 

xl{t)<xle{^)<xl{t-l), 

W^x ~ "T-IItv ^ ~ /IItv ^ ^ — ?7^||tv- 

Proof. This is immediate from (2.5)-(2.6) and 
Lemma 2.3. □ 

2.3 Exponential Families and Conjugate Priors 

Three topics are covered in this section: exponen- 
tial famihes, conjugate priors for exponential fami- 
lies and conjugate priors for location families. 

2.3.1 Exponential families. Let be a cj-finite 
measure on the Borel sets of the real line M. De- 
fine 6 = {6* e M : / e^^fi{dx) < oo}. Assume that 6 is 
nonempty and open. Holder's inequality shows that 
is an interval. For 9 £ Q, set 

M{9) = log J e^V('^3;), 

The family of probability densities {fg,9 E 0} is 
the exponential family through /i in its "natural 
parametrization." Allowable differentiations yield the 
mean m{9) = J xfg{x)fj,{dx) = M'{9) and the vari- 
ance a'^{9) = M"{9). 

Statisticians realized that many standard families 
can be put in such form so that properties can be 
studied in a unified way. Standard references for ex- 
ponential families include [8, 11, 66, 73, 74]. 

Example. Let ^ = {0, 1,2,3, .. .}, //(x) = 1/x!. 
Then 6 = M, and M{9) = e\ 

x9—e^ 

fe{x) = ^—, x = 0,l,2,.... 

This is the Poisson(A) distribution with A = e^. 

This paper works with familiar exponential fam- 
ilies. Many exotic families have been studied. See 
[12, 79]. These lead to interesting problems when 
studied in conjunction with the Gibbs sampler. 



2.3.2 Conjugate priors for exponential families. 
With notation as above, fix no > and xq in the 
interior of the convex hull of the support of /i. Define 
a prior density with respect to Lebesgue measure d9 
by 

7^no,xo{d9) = ^(no,xo)e'^°^«^-"«*-^(^) d9, 

where z{no,xo) is a normalizing constant shown to 
be positive and finite in Diaconis and Ylvisaker [29] 
which contains proofs of the assertions below. The 
posterior is 

■n:{d9\x) = ■n:no+i,{noxo+x)/{no+i){dG). 

Thus the family of conjugate priors is closed under 
sampling. This is sometimes used as the definition 
of conjugate prior. A central fact about conjugate 
priors is 

(2.8) E{m{9)\x) = ax + b. 

This linear expectation property characterizes con- 
jugate priors for families where /i has infinite sup- 
port. Section 3 shows that linear expectation implies 
that the associated chain defined at (2.1) always has 
an eigenfunction of the form x — c with eigenvalue 
a, and c equal to the mean of the marginal distribu- 
tion. 

Often, an exponential family is not parametrized 
by the natural parameter 9, but in terms of the mean 
parameter m{9). If we construct conjugate priors 
with respect to this parametrization, then (2.8) does 
not hold in general. However, for the six exponen- 
tial families having quadratic variance function (dis- 
cussed in Section 2.4 below), (2.8) holds even with 
the mean parametrization. In [19], this is shown to 
hold only for these six families. See [15, 55] for more 
on this. 

Example. For the Poisson example above the 
conjugate priors with respect to 9 are of the form 

z(no,xo)e"«^'o^-"«"''de. 

The mean parameter is A = e^. Since d9 = dX/ X, the 
priors transform to 

z(no,xo)A"«^'o-ie-"«^dA. 

The Poisson density parametrized by A is given 
by 

hix) = j , X = 0,l,2, .... 

x\ 
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Hence, the conjugate priors with respect to A are of 
the form 

S(no,2;o)A"o^°e-"°^dA. 

Here, the Jacobian of the transformation 9 — > m[9) 
blends in with the rest of the prior so that both 
parametrizations lead to the usual Gamma priors 
for the Poisson density, which satisfy (2.8). 

2.3.3 Conjugate priors for location families. Let 
^ be Lebesgue measure on M or counting measure 
on N. In this section we consider random variables of 
the form Y = 9 + with 9 having density tt{9) and 
e having density g[x) (both with respect to /i). This 
can also be written as [densities w.r.t. ^{dx) x ^{d9)] 

fe{x)=g{x-9), 

fix,9)=g{x- 9)71(9). 

In [30], a family of "conjugate priors" vr is suggested 
via posterior linearity. See [82] for further develop- 
ments. The idea is to use the following well-known 
fact: If X and Y are independent random variables 
with finite means and the same distribution, then 
E{X\X + Y) = {X + Y)/2. More generally, if and 
Xg are random variables which are independent with 
Xr (resp. Xg) having the distribution of the sum of 
r (resp. s) independent copies of the same random 
variable Z, then E{Xr\Xr + Xg) = :p^{Xr + Xs). 
Here r and s may be taken as any positive real num- 
bers if the underlying Z is infinitely divisible. 

With this notation, take g as the density for X^ 
and vr as the density for Xg and call these a conjugate 
location pair. Then the marginal density m{y) is the 
convolution of g and vr. 

Example. Let g{x) = e^^X^/x\ for x e X = {0,1, 
2, . . .}. Take 9 = A' and let tt{9) = e-'^rf I9\. Then 
mix) = e~('^+'')(A + r?)^/x! and 



tt{9\x) = 




0<9 <x <oo. 

The Gibbs sampler (bivariate chain K) for this ex- 
ample becomes: 

• From X, choose 9 from Binomial (x, ?7/(r/ -|- A)). 

• From 9, choose X = 9 + e with e ~ Poisson(A). 

The x-chain may be represented as Xn+i = Sx„ + 
En+i with Sk ~ Binomial ( /c, ?7/(?7 -|- A)) and 
e~ Poisson(A). This also represents the number of 
customers on service in an M/M/ oo queue observed 



at discrete times: If this is X„ at time n, then Sx„ is 
the number served in the next time period and Sn+i 
is the number of unserved new arrivals. The explicit 
diagonalization of the M/M/oo chain, in continuous 
time, using Charlier polynomials appears in [3]. 

This same chain has yet a different interpreta- 
tion: Let frjU) = (])p'il - pT'^. Here < p < 1 
is fixed and r] G {0,1,2,...} is a parameter. This 
model arises in underreporting problems where the 
true sample size is unknown. See [85]. Let r/ have a 
Poisson (A) prior. The Gibbs sampler for the bivari- 
ate distribution f{j,r]) = (^)p' {l—p)^~^ e~^X'^ /r]l goes 
as follows: 

• From 7], choose j from Bin(r/,p). 

• From j, choose r] = j + e with e ~ Poisson(A(l — 
P))- 

Up to a simple renaming of parameters, this is the 
same chain discussed above. Similar "translations" 
hold for any location problem where Tr{9\x) has 
bounded range. 

Note finally that there are natural statistical fam- 
ilies and priors not of exponential form where the 
analysis works out neatly. The hyper geometric dis- 
tribution for sampling from a finite population with 
a hypergeometric prior is developed in [28]. 

2.4 The Six Families 

Morris [86, 87] has characterized exponential fam- 
ilies where the variance o'^(0) is a quadratic func- 
tion of the mean: (t'^{9) = vq + vim{9) + V2m?'{9). 
These six families have been characterized earlier by 
Meixner [83] in the development of a unified theory 
of orthogonal polynomials via generating functions. 
In [56] the same families are characterized in a re- 
gression context: For Xi independent with a finite 
mean, X = j^Y.^i,Sn = ^^Hi^i - one has 

E{Sn\X = x) = a + bx + cx^ 

if and only if the distribution of Xi is one of the six 
families. In [39, 40], the six families are characterized 
by a link between orthogonal polynomials and mar- 
tingales whereas [41, 92] makes a direct link to Lie 
theory. Finally, Consonni and Veronese [19] find the 
same six families in their study of conjugate priors: 
The conjugate priors in the natural parametrization 
given above transform into the same family in the 
mean parametrization only for the six families. Wal- 
ter and Hamedani [105] construct orthogonal poly- 
nomials for the six families for use in empirical Bayes 
estimation. See also Pommeret [93]. 
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Extensions are developed by Letac and Mora [76] ^ / " \ A: 

and Casalis [15] who give excellent surveys of the Ex{9 ) = ( ^^qjy ) "'"Z^^i-^"'- 

literature. Still most useful, Morris [86, 87] gives a 

unified treatment of basic (and not so basic) prop- Negative Binomial: ^ = {0, 1, 2, . . .}, counting mea- 

erties such as moments, unbiased estimation, or- sure, = (0, 1): 

thogonal polynomials and statistical properties. We y{x + r) 

give the six families in their usual parametrization fe{x) = Y(^r)x\ ^^^^ ~ 

along with the conjugate prior and formulae for the 

moments Eg{X''), E^{e^) of X and 6 under dP = < 6* < l,r > 0. 

f0{x)fi{dx)Tr{d6), given the value of the other. For r{a + (3) 

each of these families, Ee{X'') and E^{e^) are poly- ^(^^) = T{a)T{P) ~ 

nomials of degree A; in ^ and x, respectively. We only 

specify the leading coefficients of these polynomials < a, (3 < oo. 

below. In fact, the leading coefficients are all we re- / 6 \^ '^"^ / ft ^ 

quire for our analysis. In the conditional expectation Eg{X ) = (r)^ f — - \ + ^ aj { — 

formulae, k is an integer running from to oo unless i=0 
specified otherwise. For a G M and n E NU {0}, we / / a \k\ k-i 

define ( ( j^^j j = (P + r - k)kx'' + J2 

{a)n = a(a + 1) • • • (a + n - 1) ^■=° 



if n > 1, (a)o = 1. 



k<f3 + r. 

r 1 r 11 ■ 11- Normal: = = M, u Lebesgue measure: 

While some of the following calculations are stan- ^ 

dard and well known, others are not, and since the fe{x) = —j= I" ^ 

details enter our main theorems, we give complete \/27ro^ 

statements. < cj^ < oo 

Binomial: X = |0, . . . , n|, n counting measure, G = i 

(0, 1): ^{dO) = -yL= e(-V2)(^-)V^^ dO, 



n 



V27rT2 



/g(x)= ( ^ 0<^<1, -oo<u<oo, 0<r<oo, 



fc-i 



7r(d^) 



r(a)r(/3) j=o 



= (n - A; + + ^ a,#, 



0<a,/3<oo, _ / ^2 ^ fc _ 

t2 + 



^^(0*^) = - — - — -x^ + y 6, 



j=o Gamma: = = (0, oo), /i Lebesgue measure: 

0<A:<n, , ^ ^a-i^-x/e 



^ ' J=o TT{d6) = — — dO, 0<6,c<oo, 

Poisson: X = {0, 1,2,.. .}, /i counting measure, G = 

(0,oo): EeiX'') = {a)ke\ 

q-Sqx k-1 

^(^) = — 0<^<oO' i?,(0'=) = (a + 6-A:)fc2;^ + 5^6jX^ 



j=0 



Qa—l^—B/a 

■Tr{d9) = d6, 0<a,a<oo, 0<k<a + b. 



T{a)a<^ 



k-l 



Hyperbolic: A' = G = M, ^ Lebesgue measure: 

r-2 



i=0 ' ' ' 7rr(l + e2)r/2 
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r irx r irx 



where f5{a,b) 



2 ' 2 2 , 

r >0, 

mm 



r(a + 6) ' 



7r{de) 



r{p/2 - p5i/2)T{p/2 + p5i/2) 

r(p/2)r(p/2-i/2)0F 
■ de, 



(1 + 02)^/2 



-oo < 6 < oo, p>l, 



k-l 
j=0 



1 



r^{r + p — k 



k-l 

—x^ + Y^b^x^, 

< k <r + p - 1. 



A unified way to prove the formulas involving 
EeiX'') follows from Morris [87], (3.4). This says, 
for any of the six families with 171(6) the mean pa- 
rameter and pk{x,mQ) the monic, orthogonal poly- 
nomials associated to the parameter 6q, 

Eg{pk{x,mo)) = bk{m{e) - m(6'o))^, 

where, if the family has variance function cr'^{9) = 
V2'm?{6) + vim{6) + vo, 

k-l 

bk = l[il + iv2). 

For example, for the Binomial(n, 9) family, m{6) = 
n9, a'^{9) = n9{\ — 9) , so V2 = — 1/n and 



'k-l 



Eo{pk{x,mo)) = < n( 



n — I] 



. i=0 



Comparing lead terms and using induction gives the 
first binomial entry. The rest are similar; the values 
of V2 are U2(Poisson) = 0, t'2(NB) = 1/r, 
f 2 (Normal) = 0, 112 (Gamma) = l/r, f 2 (Hyperbolic) = 
1. Presumably, there is a unified way to get the 
Ex{9^) entries, perhaps using [87], Theorem 5.4. This 
result shows that we get polynomials in x but the 
lead coefficients do not come out as easily. At any 
rate, they all follow from elementary computations. 

Remarks. 1. The moment calculations above are 
transformed into a singular value decomposition and 



an explicit diagonalization of the univariate chains 
(x-chain, 0-chain) in Section 3. 

2. The first five families are very familiar, the sixth 
family less so. As one motivation, consider the gen- 
eralized arc sine densities 



fe{y) 



y 



Hi 



T{9)T{1 



0<y, 9<l. 



Transform these to an exponential family via x 
log(?//(l — y)), 7] = ^9 — 7r/2. This has density 



^xr}+log{cosr}) 

2cosh((7r/2)x)' 

vr vr 

— 00 < < 00, <r7<— . 

2 ' 2 



The appearance of cosh explains the name hyper- 
bolic. This density appears in [44], page 503, as 
an example of a density which is its own Fourier 
transform (like the normal) . Many further references 
are in [37, 86, 87]. In particular, go{x) is the den- 
sity of ^ log |C| with C standard Cauchy. The mean 
of grj{x) is tan(r/) = 9. Parametrizing by the mean 
leads to the density shown with r = 1. The aver- 
age of r independent copies of independent vari- 
ates with this density gives the density with general 
r. The beta function is defined as usual; (3{a, b) = 
r(a)r(6)/r(a + 6). 

The conjugate prior for the mean parameter is of 
Pearson Type IV. When 6 = this is a rescaled t 
density. For general 6 the family is called the skew t 
in [37] which contains a wealth of information. Un- 
der the prior, the parameter 9 has mean pS/{p — 2) 
and satisfies 

{p-{k + 2))E{9''+^) = kE{9^-^) + p5E{9^), 

l<k<p-2. 

This makes it simple to compute the Ex{9^) entry. 
Moments past p are infinite. 

The marginal distribution m(x) can be computed 
in closed form. Using Stirling's formula in the form 
|r((T + it)\ ~ ^/2^: e^''l*l/2|t|'^-i/2 as \t\ ] 00, shows 
that m{x) has tails asymptotic to c/xf. It thus has 
only finitely many moments, so the x-chain must be 
studied by nonspectral methods. Of course, the ad- 
ditive version of our setup has moments of all order. 
The relevant orthogonal polynomials are Meixner- 
Pollaczek. 
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2.5 Some Background on Orthogonal 
Polynomials 

A variety of orthogonal polynomials are used cru- 
cially in the following sections. While we usually just 
quote what we need from the extensive literature, 
this section describes a simple example. Perhaps the 
best introduction is in [18]. We will make frequent 
reference to [63] which is thorough and up-to-date. 
The classical account [100] contains much that is 
hard to find elsewhere. The on-line account [68] is 
very useful. For pointers to the literature on orthog- 
onal polynomials and birth and death chains, see, 
for example, [104]. 

As an indication of what we need, consider the 
Beta/Binomial example with a general Beta(a,/3) 
prior. Then the stationary distribution for the x- 
chain on X = {0, 1, 2, . . . , n} is 



m{x) 



where 



r(a + x) . . . 
■ = \ =a a + 1 ••• a + : 
r(a) 



11 



The choice a = (3 = 1 yields the uniform distribution 
while a = 13 = 1/2 yields the discrete arc-sine density 
from [43], Chapter 3, 



m{x) 



/2n—2x\ 
\ n—x J 



22r. 



The orthogonal polynomials for m are called Hahn 
polynomials [see (2.10) below]. They are developed 
in [63], Section 6.2, which refers to the very useful 
treatment of Karlin and McGregor [67]. The poly- 
nomials are given explicitly in [63], pages 178-179. 
Shifting parameters by 1 to make the classical no- 
tation match present notation, the orthogonal poly- 
nomials are 



Qj{x) 



+ a + /3- 
a, —n 



1, 



< i < n. 



Here 



(2.9) 



( ai ■ 








■bs 





E 

1=0 



{aia2 - ■ ■ ar)e 



with (oi • • • ar)e = Y\.( 



ai)e. 



These polynomials satisfy 

r (n n\ (/9)i(« + /3 + J-l)n+i 

(2.10) 



(a + /3 + 2j-l) 



'ji- 



Thus they are orthogonal polynomials for m. When 
a = (3 = 1, these become the discrete Chebyshev 
polynomials cited in Proposition 1.1. From our work 
in Section 2.2, we see we only need to know Qj{xQ) 
with xq the starting position. This is often available 
in closed form for special values, for example, for 
xo = and xq = n, 



(2.11) 



Q,(0) = 1, 

(a + l)j 



i=l 



For general starting values, one may draw on the 
extensive work on uniform asymptotics; see, for ex- 
ample, [100], Chapter 8, or [5]. 

We note that [86], Section 8, gives an elegant self- 
contained development of orthogonal polynomials 
for the six families. Briefly, if feix) = e^^~^^(^) is 
the density, then 

Pfc(x,^)=a2'=|^/,(x)|//,(x) 

[derivatives with respect to the mean m{9)]. lia'^{9) = 
V2'nn?{0) + vim{9) + vq, then 

k~l 

EeipnPk) = SnkakO-'^^ with a/c = A;! J| (1 iv2). 

We also find need for orthogonal polynomials for 
the conjugate priors '7t{9). 

3. A SINGULAR VALUE DECOMPOSITION 

The results of this section show that many of the 
Gibbs sampler Markov chains associated to the six 
families have polynomial eigenvectors, with explic- 
itly known eigenvalues. This includes the x-chain, 9- 
chain and the random scan chain. Analysis of these 
chains is in Sections 4 and 5. For a discussion of 
Markov operators related to orthogonal polynomi- 
als, see, for example, [6]. For closely related statis- 
tical literature, see [13] and the references therein. 

Throughout, notation is as in Section 2.1. We have 
{fe{x)}eee a family of probability densities on the 
real line M with respect to a cr-finite measure n{dx), 
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for ^ G G C M. Further, Tr{d6) is a probability mea- 
sure on 0. These define a joint probability P on 
M X with marginal density m{x) [w.r.t. fi{dx)] and 
posterior density [w.r.t. the prior TT{d9)] given by 
7r{9\x) = /e(x)/m(x). The densities do not have to 
come from exponential families in this section. 

Let c = #suppm(x) . This may be finite or infinite. 
For simplicity, throughout this section, we assume 
supp(7r) is infinite. Moreover, we make the following 
hypotheses: 

(HI) For some ai,a2 > 0, / e"il^l+"2l«lp(dx, 
d6) < oo. 

(H2) For 0<k<c, Eo{X^) is a polynomial in 
of degree k with lead coefficient r]k > 0. 

(H3) For < A; < oo, Ex{0^) is a polynomial in x 
of degree k with lead coefficient //fc > 0. 

By (HI), L^{m{dx)) admits a unique monic, or- 
thogonal basis of polynomials pk, < k < c, with pk 
of degree k. Also, L'^{7r{d6)) admits a unique monic, 
orthogonal basis of polynomials q^, < k < oo, with 
Qk of degree k. As usual, tjq = ^q = 1 and po = qq = 1. 

Theorem 3.1. Assume (H1)-(H3). Then: 

(a) The x-chain (2.1) has eigenvalues f3k = i]kfJ'k 
with eigenvectors pk, 0< k < c. 

(b) The 6-chain (2.2) has eigenvalues (3k = rjk^k 
with eigenvectors qk for < k < c, and eigenvalues 
zero with eigenvectors qk for c<k<oo. 

(c) The random scan chain (2.3) has spectral de- 
composition given by 



eigenvalues | it \^/rikJIk, 

eigenvectors Pk{x) ± . —qf;{6), 
V ^J'k 



0<k<c, 



eigenvalues^, eigenvectors qk, c<k<co. 

The proof of Theorem 3.1 is in the Appendix. 

Remark. The theorem holds with obvious mod- 
ification if ^supp(7r) < oo. This occurs for binomial 
location problems. It will be used without further 
comment in Section 5. Further, the arguments work 
to give some eigenvalues with polynomial eigenvec- 
tors when only finitely many moments are finite. 

4. EXPONENTIAL FAMILY EXAMPLES 

This section carries out the analysis of the x- and 
9- chains for the Beta/Binomial, Poisson/Gamma 
and normal families. The x- and ^-chains for the 
normal family are essentially the same. Hence, this 



section consists of five examples in all. For each, 
we set up the results for general parameter values 
and carry out the bounds in some natural special 
cases. The other three families are not amenable to 
this analysis due to lack of existence of all moments 
[which violates hypothesis (HI) in Section 3]. How- 
ever, they are analyzed by probabilistic techniques 
such as coupling in [25]. 

4.1 Beta/Binomial 

4.1.1 The x-chain for the Beta/Binomial. Fix a, 
/? > 0. On the state space X = {0,1,2, ... ,n}, let 



k{x, x') 



\x 



^ 1 aoi+x+x'—l/ 



\l3+2n~{x+x')~l 



(4.1) 



T{a + P + n)de 
T{a + x)T{(3 + n-x) 

n \ T{a + 13 + n)T{a + x + x') 
x' 



T{a + x)T{(3 + n-x) 

V{l3 + 2n- {x + x')) 
T{a + (3 + 2n) 

When a = (3 =1 (uniform prior), k{x, x') is given by 
(1.1). For general a, (3, the stationary distribution is 
the Beta/Binomial: 



m(x) 



n\ {a)x{fi)r 



(a + /3)n 



where 



From our work in previous sections we obtain the 
following result. 

Proposition 4.1. For n = 1,2,..., and a, 

/3 > 0, the Beta/Binomial x-chain (4.1) has: 

(a) Eigenvalues /3o = 1 o.i^d (3j = "^"^J"^'^^^'^"^^'' , 1 < 
j <n. 

(b) Eigenf unctions Qj, < j < n, the Hahn polyno- 
mials of Section 2.5. 

(c) For any i>l and any starting state x, 



where Zi 



(a + ^ + 2i-\){a + ^)n{,a)i (n 



(/3),(a + /3 + i-l) 



n+l 
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We now specialize this to a = /? = 1 and prove the 
bounds announced in Proposition 1.1. 

Proof of Proposition 1.1. From (a), Pi = 
"^"-^^-("-^+^) . From (2.11), Q^in) = 1. By ele- 



(n+2){n+3)-(n+j+l) 

mentary manipulations, Zi = [3i{2i + 1). Thus 



{k^{n,y) -m{y)f 



y=0 

n 



m{y) 

i=l 

We may bound Pi< Pl = {1 — ;^)% and so 

n n 



4 = 1 



i=l 



Using J2Tx' = x/{l-x), = x/(l we 

obtain 



21+1 



(i-/3r^) 



By Lemma 2.4, this gives (for the K chain) 



3/3f +^ < Xieii) < 



21-1 



{1-pr-Y 



The upper bound in total variation follows from 
(2.4). For a lower bound in total variation, use the 
eigenfunction ^i{x) = x — ^. This is maximized at 
X = n and the lower bound follows from Lemma 2.1. 
□ 

Remark. Essentially, the same results hold for 
any Beta(a, f3) prior in the sense that, for fixed a, (3, 
starting at n, order n steps are necessary and suf- 
ficient for convergence. The computation gets more 
involved if one starts from a different point than 
n. Mizan Rahman and Mourad Ismail have shown 
us how to evaluate the Hahn polynomials at n/2 
when n is even and a = (3 using [63], (1.4.12) (the 
odd-degree Hahn polynomials vanish at n/2 and the 
three-term recurrence then easily yields the values 
of the even-degree Hahn polynomials at n/2). See 
Section 5.1 for a closely related example. 

4.1.2 The 9-chain for the Beta/Binomial. Fix a, 
/9 > 0. On the state space [0, 1], let 



k{e, 



E 

j=0 



n 



6^(1-6) 



n-j 



(4.2) 



T{a + f3 + n) 



T{a + j)T{p + n-j) 



This is a transition density with respect to Lebesgue 
measure dO' on [0, 1]. It has stationary density 



no— 1 



T{a)T{f3) 

Remark. In this specific example, the prior ii{d6) 
has a density g{e) = ^^fSy^""^! - with re- 

spect to the Lebesgue measure dO. For ease of expo- 
sition, we deviate from the general treatment in Sec- 
tion 2.1, where k{6,9') is a transition density with 
respect to ir{d9'). Instead, we absorb g{e') in the 
transition density, so that k{9, 6') is a transition den- 
sity with respect to the Lebesgue measure d9' . 

The relevant orthogonal polynomials are Jacobi 
polynomials P"'^, a = a — 1, 6 = /5 — 1, given on 
[—1, 1] in standard literature [68], 1.8. We make the 
change of variables 9 = and write Pi{9) = 



pa-l,P-l 
i 

(4.3) 
where 



(1 — 29). Then, we have 



Pj{9)pk{9)7r{9)d9 = z-^Sjk, 



^ (2j + g + /3 - l)r(a)r(/3)r(j +a + (3- l)j! 
ria + (3)rij + a)T{j + l3) 

Proposition 4.2. For a, /? > 0, the 9-chain for 
the Beta/Binomial (4.2) has: 

(a) Eigenvalues Pq = 1, Pj = "^"[„^|'^^„"^^"^^^ , 1 < 
j < n, Pj = for j > n. 

(b) Eigenf unctions pj, the shifted Jacobi polyno- 
mials. 

(c) With Zi from (4.3), for any i > 1 and any 
starting state 9 G [0, 1], 

n 

x^w=E/3fffw^^- 

i=l 

The following proposition gives sharp chi-square 
bounds, uniformly over a,P,n in two cases: (i) a > 
P, starting from (worst starting point), (ii) a = 
P, starting from 1/2 (heuristically, the most favor- 
able starting point). The restriction a > /? is not re- 
ally a restriction because of the symmetry P^'^{x) = 
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{-lyP-'^i-x). For a > /3 > 1/2, it is known (e.g., u 4) . 2i + a + /? + l^ + a + /3-li + « 

[63], Lemma 4.2.1) that 2i + a + /3-l i + 1 i + [3 

sup ipjI — sup |J< \—Pi[U) — —-. -g /3 + 1 

[0,1] [-1,1] ^! p + i 

Hence, is clearly the worst starting point from . fi a + P + 

the viewpoint of convergence in chi-square distance, ^ a + (3 + n + 1 , 

that is, The lead term in Xo(^) 

sup {xm = xli^). na + P + l)a\ 

e6[o,i] V /3 / 

Proposition 4.3. For a> (3 > 0, n> 0, set N = Prom (4.4), we get that for any 

log[(a + /3)(a + l)/(/3 + l^ The e-cha^n for the £ > — 1— log[(a + /3)(a + l)/(/3 + 1)] 

Beta/Binomial {A.2) satisfies: ~ -2\og(3i ' )\ 

(i) . < 7e-, for £ > c> 0. we have 

. x§(^)>le^/or£<^^,c>0. ^'^'Tfl < 5/6. 

(ii) Assuming a = P > 0, Pi Pii^J 

2 fe\^,oo2e f ff-^ 1 Hence, for such ^, 



Roughly speaking, part (i) says that, starting from 
0, £{a,P,n) steps are necessary and sufficient for _ g/ (" + + 

convergence in chi-snuare distance where \ P J 



Pi 



21 



\og\{a + P){a + 1) /{P + 1)1 With AT = log[(a + /3)(a + 1)/(/3+ 1)] as in the propo- 

£(a,P,n) = — ^ 7 -r-r, 7; — ^r- sition, we obtain 

^ ^ -21og(l-(a + /3)/(a + /3 + n)) 

Note that if a, n, n/a tend to infinity and P is fixed, ^o(^) " for ^ > _2 , c > 0; 

If a,n,n/a tend to infinity and a = /?, Proof of Proposition 4.3(n). When a = b, 

nloga 2a classical Jacobi polynomial P^'^ is given by 

£{a,a,n)r , /3i ~ 1 . ^nM^\ 

4a n Pn-) = W^Ct^'^'i^) 

The result also says that, starting from 0, conver- l^*^ + ^)k 

gence occurs abruptly (i.e., with cutoff) at £{a,P,n) where the C^'s are the ultraspherical polynomials. 

as long as a tends to infinity. See [63], (4.5.1). Now, [63], (4.5.16) gives C^(0) = 

Part (ii) indicates a completely different behav- if n is odd and 

ior starting from 1/2 (in the case a = P). There is ^ (2i^)„(— 1)"/^ 

no cutoff and convergence occurs at the exponen- C'„(0) — 2n(^^y'2)!(i/ + l/2)„/2 

tial rate given by P2 (/32 ~ 1 - ^ ff n/a tends to . u ^ ^- ^-x. \.-(^ a i u-' +v,- 

. r. \ if n IS even. Going back to the shifted Jacobi s, this 

yields p2fc+i(l/2) = and 

Proof of Proposition 4.3(i). We have xni^) = (rv^^, 1 /o 

E?/?f P.(0)^.. and P2.{l/2) = j0^C-''\^) 

PlilP^+M^^^+l (a)2fc (2a-l)2fc(-l)^ 

Pf%{Qfzi (2a-l)2fc 22fc/t!(a)fc 

V' _ {a + kU-lf 



a + P + n + ij 2'^^k\ 
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We want to estimate 

Ln/2J 
1 

and thus we compute 

^2fi+l)^'2{j+l) (l/2)^^2{i+l) 
fe(l/2)2z2. 

{n-2i){n-2i-\) 
(2a + n + 2i)(2Q + n + 2i + 1) 
4z + 2a + 1 2z + 2a - 1 
4i + 2a - 1 2i + 2a + 1 
2i(2i + l)(2a + 2i + l)(2a + 2i) 
(2i + a)2(2f + a + l)2 

(a + 2i)(a + 2i + l)^ ^ 
4(a + i)(i + l) 



The orthogonal polynomials for the negative bino- 
mial are Meixner polynomials [68], (1.9): 



Mj{x) = 2Fx 



-J -X 
a 



a 



21 



These satisfy [68], (1.92), 



x=0 



{a)j 



(4.5) 



Our work in previous sections, together with basic 
properties of Meixner polynomials, gives the follow- 
ing propositions. 

Proposition 4.4. For a,a > the Poisson/ 
Gamma x-chain (4.6) has: 

(a) Eigenvalues (3j = (a/(l + a)y , < j < oo. 

(b) Eig enf unctions Mj{x), the Meixner polynomi- 



< Iff- 



Hence 



als. 



(c) For any ^ > and any starting state x 
{k^{x,y) - m{y)f 



Xv2(^)<10/32>2 (1/2)^^2 for£> 



-21og/?2 



X^(^) = E 

y=0 



m{y) 



As 



, , , a + 1 , 4(2a + 3) 

P2(l/2) = ^— and Z2- ^ ' 



(a)i 



4 '^'^ a(a + l)2' 

this gives xl/2i^) > \02 and, assuming i > mHp^' 

4.2 Poisson/Gamma 



i=i {l + a)HV 

Proposition 4.5. For a = a = l, starting at n, 
Xn W < 2"'^ for i = log2(l + n) + c, c> 0; 
X^(£) > 2^- /or i = log2(n - 1) - c, c> 0. 
Proof. From the definitions, for all j and pos- 



4.2.1 The x-chain for the Poisson/Gamma. Fix itive integer 
a,a>0. For x,y e X = {0,1,2, ...} = N, let 

foo g— A{o+l)/a^a+x-l 



k{x,y) 



\Mj{x)\ 



(4.6) 



r(a + x)(a/(a-|-l))°+^ 



j=0 



E(-i)M' U(x-i)---(^-^+i) 



i=0 



<E •^^^ = (l + ^)'■• 



^(a+x+y)(a/(2a+l))''+^+^^ Thus, for £>log2(l + n) + c. 



r(a + x)(a/(a -M))«+^y! 
The stationary distribution is the negative binomial 



An 



m{x) 



ia).f 1 



x! \a + l/ \a + l. 
When a = a = 1, the prior is a standard exponential, 
an example given in Section 2.1. Then, 



k{x,y) 



y 



2, 



(£) = ^M|(n)2-^'(2^+i) 
i=i 

oo 

<^(l+„)2i2-^(2^+l) 

(l + n)22-(2W) 
- l-(l + n)22-(2^+i) 



-2c-l 



m(x) = 1/2' 



z+l 



< 



1 - 2-2C-1 



< 2 



-2c 
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The lower bound follows from using only the lead 
term. Namely, 



/\ (a-l)/2 



1 



>2^'= for £ = log2(n- 1) -c. 



□ 



a/{l+a) \ e 
^-e-(a+i)e'/a 

a/ (1 + a) 



Remark. Note the contrast with the Beta/ 
Binomial example above. There, order n steps are 
necessary and sufficient starting from n and there 
is no cutoff. Here, log2 n steps are necessary and 
sufficient and there is a cutoff. See [27] for further 
discussion of cutoffs. 

Jim Hobert has shown us a neat appearance of 
the x-chain as branching process with immigration. 
Write the transition density above as 



k{x,y) 



r(a + X + y) f a+1 



T{a + x)y\ \2a + l 



a 



2a + 1 



■{2^ {a + 1)96' /a). 

Thus, k{6,9') is a transition density with respect to 
the Lebesgue measure dO', that is, 



k{e,e')de' = 1. 



Here Ia~i is the modified Bessell function. For fixed 
9, k{9,9') integrates to 1 as discussed in [44], pages 
58-59. The stationary distribution of this Markov 
chain is the Gamma: 



This is a negative binomial mass function with 



^{d9) 



parameters 9 



2a+l 



and r = X + a. Hence, the x- 



-e/aga-l 



■d9. 



chain can be viewed as a branching process with 
immigration. Specifically, given that the population 
size at generation n is X„ = x, we have 



To simplify notation, we take a = 1 for the rest of 
this section. The relevant polynomials are the La- 
guerre polynomials ([68], Section 1.11) 



X 



n+l 



X 

E 

1=1 



Ni.n + M, 



n+l) 

are i.i.d. negative binomi- 
" and r = 1, and M^+i, 



a 



where iVi,n,X2,„, . . . ,iVx 
als with parameters 9 = 2a+i 
which is independent of the Ni^^s, has a negative 
binomial distribution with parameters 9 = 2a+i 
r = a. The Ni^nS represent the number of offspring 
of the nth generation and M„+i represents the im- 
migration. This branching process representation was 
used in [61] to study admissibility. 

4.2.2 The 9-chain for the Poisson/Gamma. Fix 
Q, a > 0. For 0, 0' G e = (0, cx)) , let r/ = (a + l)6l7a. 
Note that TT{d9') = g{9')d9\ where g{9') = 

— \ I — . Hence, as in Section 4.1.2, we absorb 
g{9') in the transition density of the 0-chain. This 
gives 

g-0'(a+l)/a(-5)/)a+j-l 



Note that classical notation has the parameter a 
shifted by 1 whereas we have labeled things to mesh 
with standard statistical notation. The orthogonal- 
ity relation is 

Via A- i\ 

U{9)Lm<0)d9=^^5., 

The multilinear generating function formula ([63], 
Theorem 4.7.5) gives 



k{9, 



oo 

i=o 



j! T{a + j){a/{a + l)Y+^ 



Y.Li{9fzif 

i=0 



1 



{l-tY ^j\{a),\l-i 



a/(a + l) 



E 



j!r(a + i) 



a/{a + l) \ 9 



(a-l)/2_oo 



(4.7) 



E- 

i=o 



9/\2j+a-l 



j!r(a + i) 



Combining results, we obtain the following state- 
ments. 

Proposition 4.6. For a = l and a>0, the Markov 
chain with kernel (4.7) has: 

(a) Eigenvalues jSj = —,0 < j < oo. 

(b) Eigenfunctions Lj, the Laguerre polynomials. 
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(c) For any £>1 and any starting state 9, 
^ r(a + jj 



-2£+ie 



/(l-2-2^) 



(l_2-2£)a 

■^j!(a)Al-2-4V 

Proposition 4.7. For a = 1 and a > 0, ^/le 
Markov chain with kernel (4.7) satisfies: 

• For > 0, < e^2-^ if £ > i(log2[2(l + a + 
^Va)] +c), OO. 

• For e £ (0,a/2) U (2a, oo), X0(^) > 2^= i/ ^ < 
i(log2[i(^Va + a)] -c), OO. 

Proof. For the upper bound, assuming i > 1, 
we write 

;t2(^) = (l_4-^)-ag-(2e4-0/(l-4-^) 



1 



< 



„ j!(a)Al-4- 
exp((20V«)4-O 



(l_4-^)a 



< 2(6'Va + a)4" 



exp(2(^Va)4^ 



(l_4-^)a+i 

For i(log2 [2(1 + 192/0 + a)] + c), c> 0, we obtain 
X^(^)<e22-^ 

The stated lower bound does not easily follow 
from the formula we just used for the upper bound. 
Instead, we simply use the first term in Xe(^) = 
E,>i/3fF2(e)^, that is, a-He-afi-'. This 
easily gives the desired result. □ 

Remark. It is not easy to obtain sharp formulas 
starting from 9 near a. For example, starting at 9 = 
a, one gets a lower bound by using the second term 
xlW = Ej>i PfL]{9)f^^ (the first term vanishes 
at 6* = a). This gives xl{^) > [2a/(a + l)]4-2^ When 
a is large, this is significantly smaller than the upper 
bound proved above. 

4.3 The Gaussian Case 

Here, the x-chain and the 0-chain are essentially 
the same and indeed the same as the chain for the 
additive models, so we just tre at the x-chain. Let 

;f = M, fe{x) = e-i/2(^-^)'/^'/y2^ and T^{d9) = 



\/27rr2 



■ d9. The marginal density is Normal (u, 



A stochastic description of the chain is 



with a ■■ 



Normal — rr,o' 

Vcj2 + t2' 



(72 + t2 ' 

This is the basic autoregressive (ARl) process. Feller 
([44], pages 97-99) describes it as the discrete-time 
Ornstein-Uhlenbeck process. The diagonalization of 
this Gaussian Markov chain has been derived by 
other authors in various contexts. Goodman and 
Sokal [50] give an explicit diagonalization of vector- 
valued Gaussian autoregressive processes which spe- 
cialize to (a), (b) below. Donoho and Johnstone ([31], 
Lemma 2.1) also specialize to (a), (b) below. Both 
sets of authors give further references. Since it is so 
well studied, we will be brief and treat the special 
case with i^ = 0, (T2 + r2 = l/2. Thus the stationary 
distribution is Normal(0, 1/2). The orthogonal poly- 
nomials are now Hermite polynomials ([68], 1.13). 
These are given by 



Bn{y) = (2y)"2Fo 



-n/2,-(n- l)/2 



1 



n-2k 



{-in2y) 
■ k\{n-2k)\ 



They satisfy 

1 2 
-= \ e-y H„,{y)Hn{y) dy = 2"n!5„„. 

V^r J-00 



There is also a multilinear generating function for- 
mula which gives ([63], Example 4.7.3) 

2xH ' 



„ 2"n! 



1 



exp 



l + t 



Proposition 4.8. For 1/ = 0, 0-2 + r2 = 1/2, the 
Markov chain (4.8) has: 

(a) Eigenvalues (3j = (2r2)-5'' (as cj2 + r2 = 1/2, we 
have 2t2 < 1). 

(b) Eigenf unctions the Hermite polynomials Hj. 

(c) For any starting state x and all i> 1, 

1 



k=l 



exp(2x2(2r2)27(l + (2r2)2«)) 



2\2e\ 



1 - (2r2) 



4£ 
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The next proposition turns the available chi-square 
formula into sharp estimates when x is away from 0. 
Starting from 0, the formula gives Xo(^) ~ 
(1 — {2t'^)'^^)~^/'^ — 1. This shows convergence at the 
faster exponential rate of = (2r^)^ instead of j3i = 
2t\ 

Proposition 4.9. Foru = 0, a'^ + r'^ = 1/2, x e 
M, the Markov chain (4.8) satisfies: 

Xxl«J^»e jori^ -21og(2r2) ' ' 

log(2(l + x2)) -c 



2(l + a;2) 



fore< 



-21og(2r2) ' 

OO, 

Proof. For the upper bound, assuming 



> 



1 



-(log(2(l+x2))+c), OO, 



-21og(2r2) 
we have 

(2r2)2^ < 1/2, 2x2(2r2)2^ < 1 
and it follows that 



exp(2x2(2 r2)27(l + (2r2)2^)) 
/l _ (2r2)4^ 



1 



<(l+2(2r2)4^)(l + 6x2(2r2)2^)-l 
<8(l + x2)(2r2)2^ 

For the lower bound, write 



exp(2a;2(2 r2)27(l + (2r2)2^)) 
/l _ (2r2)4^ 



1 



>exp(x2(2r2)27-l 
>x2(2r2)2^ 

5. LOCATION FAMILIES EXAMPLES 



□ 



Sharp rates of convergence using spectral tech- 
niques are not restricted to exponential families and 
conjugate priors. In this section, we show how simi- 
lar analyses are available for location families. Very 
different examples are collected in Section 6. In this 
section fe{x) = g{x — 6) with g and vr members of 
one of the six families of Section 2.4. To picture the 
associated Markov chains it is helpful to begin with 
the representation X = 9 -\- e. Here is distributed 
as TT and e is distributed as g. The x-chain goes as 



follows: from x, draw 9' from '/r(-|x) and then go to 
X' = 9' -\- e' with e' independently drawn from g. 
It has stationary distribution ra{x) dx, the convolu- 
tion of TT and g. For the 0-chain, starting at 9, set 
X' = 9 + e and draw 9' from 7r(-|x'). It has stationary 
distribution vr. Observe that 

Eg{X'')=Eg{{9 + ef) 



j=0 



Thus (H2) of Section 3 is satisfied with rjk = I. To 
check the conjugate condition we may use results 
of [87], Section 4. In present notation, Morris shows 
that if pk is the monic orthogonal polynomial of de- 
gree k for the distribution vr and p'j. is the monic 
orthogonal polynomial of degree k for the distribu- 
tion m, then 



ni 



ni + 712 



bkp'k{x). 



Here vr is taken as the sum of rii copies and e the 
sum of n2 copies of one of the six families and 



fc-i 



-p-j- l + ic/ni 

+ic/{ni -I- 712)' 



1=0 



where c is the coefficient of m?{9) in (t'^{9) = a + 
bm{9)+c'rin?{9) for the family. Comparing lead terms 
gives (H3) (in Section 3) with an explicit value of Hk- 
In the present setup, ^k = Pk is the kth eigenvalue. 

We now make specific choices for each of the six 
cases. 

5.1 Binomial 



For fixed p, < p < 1, let vr = Bin(ni,p), g 
in(n2,p). Then m = Bin(ni + n2,p) and 

/n-i \ / r).n \ 



is hypergeometric. The ^-chain progresses as a popu- 
lation process on < < ni: from 9, there are £ new 
births and the resulting population of size x = 9 + £ 
is thinned down by random sampling. The x-chain 
denoted by {Xk}k>o can be represented in an au- 
toregressive cast. More precisely. 



(5.1) 



Xk+i = Sxf. + £k+i, 



where Sx^ is a hypergeometric with parameters ni,n2, 
Xk and £k+i is drawn independently from Bin(n2,p). 



20 



P. DIACONIS, K. KHARE AND L. SALOFF-COSTE 



For the binomial, the parameter c is c = — 1 and 
the eigenvalues of the x-chain are 

ni(ni - 1) • • • (rai - A: + 1) 

(rei + re2)(ni + n2 - 1) • • • (ni + n2 - A; + 1) ' 

< A: < 77,1 + 77,2 = N. 

Note that f3k = for > 77i + 1. The orthogonal 
polynomials are Krawtchouck polynomials ([68], 1.10; 
[63], page 100): 

P. 

which satisfy 

N 



This gives the desired result since we also have Xo (^) ^ 
NqPf. □ 

In general, it is not very easy to evaluate the poly- 
nomials kj at X ^ to estimate Xxi^) s-i^d under- 
stand what the role of the starting point is. However, 
if p = 1/2 and x = ni = N/2, then the recurrence 
equation ([63], (6.2.37)) shows that k2j+i{N/2) = 
and 



(5.2) 



kj{x) = 2-Fi 



-J,-x 
-N 



k2j{N/2) = i-iy 



(2i-l)! 



E(" )p"(l-p)^-"%(x)A;,(x) 

x=0 



NY^ fl-p 



P 



Proposition 5.1. Consider the chain (5.1) 077 
{0, . . . , 77i + 772} with < p < 1, starting at x = 0. 
Set N = ni+n2, q = p/{l — p). Then we have 

e-^<Xo(^)<e-V"^ 

777/1 e 77 e7;er 

log{qN) +c 



I-- 



c G (—00, cxd). 



-21og(l-n2/A^)^ 
Note two cases of interest: (i) For p= 1/2, the 
proposition shows that ^2iog{i^n2/N) steps are nec- 
essary and sufficient. There is a chi-square cutoff 
when tends to infinity, (ii) For p = 1/N, there is 
no cutoff. 

Proof. From (2.9) and (5.2), we have /c|(0) = 1 
for all j and the chi-square distance becomes 



J 



with = 77i + 772, Q = p/i^ — p) and £>1. For j < 
771, the eigenvalues satisfy 

Hence, we obtain 



7^ 4^-1^^ 



N 



N 



j=l \ / 

= (l + <7/3f)^-l 
<giV/3f(l + #f)^-i. 



Thus, for ^> 1, 

x%/2W = E 



(Ar_2j + l)2/ 
(2i-l)! 



2£ 



{N - 2j + 1)2, 
(2j-l)! 



riv/41 

f={ ^'^2j{N-2j + l)2j- 

This is small for £ = 1 if is large enough. Indeed, 
splitting the sum into two parts at N/ 8 easily yields 



X%/2W<^2-^' + 



2N 



5.2 Poisson 



Fix positive reals /i, 771,772. Let tt = Poisson(/L7ni), 
(jr = Poisson (/7 772). Then 



77T, = PoisSOn(/7(77i + 772)) 



and 



Bin X, 



771 



771 + "^2 



The x-chain is related to the M/M/oo queue and the 
0-chain is related to Bayesian missing data examples 
in Section 2.3.3. Here, the parameter c = so that 



771 



771 +^2 



0<k<(X>. 



The orthogonal polynomials are Charlier polynomi- 
als ([68], 1.12; [63], page 177): 



-J:-X 



E- 



Cj{x)Ck{x) =j\fi ^jk- 



We carry out a probabilistic analysis of this problem 
in [25]. 
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5.3 Negative Binomial 

Fix p with < p < 1 and positive real ni , n2 . Let 
TT = NB(ni,p), g = NB(n2,p). Then m = NB(ni + 
77-2, p) and 

'x\ T{ni+n2)T{e + ni)T{x-e-n2) 



""^^'^^ \0) r(x + ni+n2)r(ni)r(n2) 

< 61 < X, 

which is a negative hyper geometric. A simple ex- 
ample has ni = n2 = 1 (geometric distribution) so 
tt{6\x) = 1/(1 + x). The x-chain becomes: From x, 
choose 9 uniformly in < < x and let X' = 6 + e 
with e geometric. The parameter c = 1 so that 

/3o = l, 

ni(ni + 1) • • • (ni + fc - 1) 
(fii + n2)(ni + 712 + 1) • • • (ni + ^2 + - 1) ' 

1 < < oo. 



The orthogonal polynomials are Meixner polynomi- 
als discussed in Section 4.2. 

5.4 Normal 

Fix reals ^ and ni, n2, f > 0. Let vr = Normal(ni/i, 
niu), (7 = Normal(n2/u, n2u). Then m = Normal((ni -|- 
n2)n,ini + n2)v) and 7r(6'|x) = Normal(^^^^x, 
^^u). Here c = and 



ni 



0<k<oo. 



,711 +n2. 

The orthogonal polynomials are Hermite, discussed 
in Section 4.3. Both the x- and ^-chains are classical 
autoregressive processes as described in Section 4.3. 

5.5 Gamma 

Fix positive real ni,n2,a. Let vr = Gamma(ni, a), 
g = Gamma(n2,a). Then 

m = Gamma(ni -\-n2,a), 

7r{6\x) = X ■ Beta(ni, 71-2). 

A simple case to picture is a = ni = ?i2 = 1. Then, 
the x-chain may be described as follows: From x, 
choose 9 uniformly in (0, x) and set X' = 6 + e with 
£ standard exponential. This is simply a continuous 
version of the examples of Section 5.3. The param- 
eter c = 1 and so 



/5o = l, 



ni{ni -I- 1) • ■ • (fii + fc - 1) 



(ni -I- n2){ni + n2 + 1) ■ ■ ■ {ni + n2 + k - 1)' 

< < 00. 



The orthogonal polynomials are Laguerre polynomi- 
als, discussed in Section 4.2 above. 

5.6 Hyperbolic 

The density of the sixth family is given in Sec- 
tion 2.3 in terms of parameters r > and \9\ < tt/2. 
It has mean fi = rtan{9) and variance /U^/r -|- r. See 
[86], Section 5 or [37] for numerous facts and refer- 
ences. Fix real and positive ni,n2- Let the density 
vr be hyperbolic with mean ni/x and ri = ni{l + fx^). 
Let the density g be hyperbolic with mean n2fJ- and 
f2 = iT-2{^ + /U^). Then m is hyperbolic with mean 
(ni + 722)/-* and r = (ni -1- 712) (1 + /i^). The condi- 
tional density Tr{9\x) is "unnamed and apparently 
has not been studied" ([87], page 581). 
For this family, the parameter c = 1 and thus 



1, 



ni(ni -I- 1) • • • (ni -I- A; - 1) 



(ni + 712) • • • (rai + ?i2 + A; - 1) 

The orthogonal polynomials are Meixner-Pollaczek 
polynomials ([100], page 395; [68], 1.7; [63], page 
171). These are given in the form 



(2A)„ 



(5.3) 



nl 



-2F1 



-n, X + ix 
2A 



1 



-2iip 



■P,yY. -P/n dX 



1 /-oo 

-/ e(2^-^)^|r(A + 
2vr J_oo 

r(n + 2A) ^ 

Here —00 < x < 00, A > 0, < ip < tt. The change 
of variables y = ip = ^ + tan~^(0) A = r/2 trans- 
forms the density e(^'^~'^)^|r(A + ix)p to a constant 
multiple of the density fe{x) of Section 2.4. 

We carry out one simple calculation. Let vr,^ have 
the density of -log|C|, with C standard Cauchy. 
Thus 



(5.4) 'n{dx) = g{x) dx 



1 



■ dx. 



2 cosh(vrx/2) 

The marginal density is the density of ^log|CiC2|, 
that is, 

mix) = — r^. 

^ ' 2sinh(7rx/2) 

For the additive walk based 



Proposition 5.2. 
on (5.4).- 

(a) The eigenvalues are Pk 



1 

k+i •■ 



0< k<oo. 
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(b) The eigenf unctions are the Meixner-PoUaczek 
polynomials (5.3) with (/9 = 7r/2, A = l. 

(c) xlii) = 2E^=i(fc + ly^'-HPtih W- 

Proof. Using T{z + 1) = zT{z), T{z)T{l - z) = 
. f N , we check that 

|r(i + ix)p = r(i + ix)r(i - ix) 

= {ix)T{ix)T{l — ix) 

Tr(ix) TTX 

sin7r(ix) sinh(7rx) 

The result now fohows from routine simphfication. 

□ 

Remark. Part (c) has been used to show that 
order log x steps are necessary and sufficient for con- 
vergence in unpublished joint work with Mourad Is- 
mail. 

6. OTHER MODELS, OTHER METHODS 

Even in the limited context of bivariate Gibbs 
sampling, there are other contexts in which the in- 
gredients above arise. These give many further ex- 
amples where present techniques lead to sharp re- 
sults. This section gives brief pointers to Lancaster 
families, alternating projections, the multivariate 
case and to other techniques for proving conver- 
gence. 

6.1 Lancaster Families 

There has been a healthy development of bivariate 
distributions with given margins. The part of inter- 
est here begins with the work of Lancaster, nicely 
summarized in his book [72]. Relevant papers by 
Angelo Koudou [69, 70, 71] summarize recent work. 
Briefly, let {X , /j,) , {y , u) be probability spaces with 
associated L^(/_f), L^(z/). Let o" be a measure on X x 
y with margins /i and i^. Suppose the relevant con- 
ditional probabilities exist, so 

a{dx,dy) = fi{dx)Kx{dy) = i>{dy)Ly{dx). 

Say that u is a Lancaster probability with respect 
to the given sequences of orthonormal functions if 
there exists a positive real sequence {pn} such that, 
for all n, 

(6.5) j pn{x)qm{y)cr{dx,dy) 

This equation is also equivalent to / qn{y)Kx{dy) = 

PnPn 

(x) and to / pn{x)Ky{dx) = Pnqniu)- 



Koudou shows that, given a, one can always find 
sequences of orthonormal functions such that a is 
Lancaster for these sequences. He then character- 
izes those sequences pn such that the associated a 
is absolutely continuous with respect to p x i/ with 

= At ^ = PnPn{x)qn{y) 

d{p X i^) „ 

in L'^{p X I/). 

For such Lancaster densities (6.6), the equivalence 
(6.5) says precisely that the x-chain for the Gibbs 
sampler for / has {pn} as eigenvectors with eigen- 
values {Pn} (cf. Section 3 above). For more on Lan- 
caster families with marginals in the six exponential 
families, see [7], Section 7, which makes fascinat- 
ing connections between these families and diagonal 
multivariate families. 

The above does not require polynomial eigenvec- 
tors and Griffiths [52] gives examples of triangular 
margins and explicit nonpolynomial eigenfunctions. 
Here is another. Let G be a compact Abelian group 
with characters {pn{x)} chosen orthonormal with 
respect to Haar measure p. For a probability density 
/ on G, the location problem 

a{dx, dy) = f{x - y)p{dx)u{dy) 

has uniform margins, {pn} as eigenfunctions of the 
x-chain and Pn = J Pn{x) f (x) dp{x) as eigenvalues. 

A main focus of Koudou and his co-authors is 
delineating all of the extremal Lancaster sequences 
and so, by Choquet's theorem (every point in a com- 
pact convex subset of a metrizable topological vector 
space is a barycenter of a probability measure sup- 
ported on the extreme points; see [84], Chapter 11.2) 
all densities / with {pn}, {q-n} as in (6.5). Of course, 
any of these can be used with the Gibbs sampler and 
present techniques. These characterizations are car- 
ried out for a variety of classical orthogonal polyno- 
mials. In particular, Koudou gives a clear translation 
into probabilists' language of Gasper's complete de- 
termination of the extremals of the Lancaster fami- 
lies with equal Beta(a,/3) margins for a,/3 > |. We 
give but one example. 

Example. Consider the uniform distribution on 
the unit disk in given by 

(6.7) f(x,y) = -, 0<x2+y2<i. 
vr 

The conditional distribution of X given Y is uni- 
form on [— Vl — Y"^, Vl — Y"^]. Similarly, the con- 
ditional distribution of Y given X is uniform on 
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[— Vl — X'^, \/l — X'^]. The marginal density of X 
is 



m 



{x) = -Vl 



-1< X < 1. 



vr 



By symmetry, Y has the same marginal density. 
Since for every A; > 0, 

(l-y2)fc 



E(X2'= I Y) 
E(y2'= I X) 



2k + 1 ' 
[l-X^)'' 
2k + 1 



and 



^^X'^k+i \Y) = 0, 

E(y2fc+i I X) = 0, 

we conclude that the x-chain has polynomial eigen- 
functions {pk{x)}k>o and eigenvalues {Afc}fc>o where 
A2fc = (2fc+i)a and X2k+i = 0, > 0. The polynomi- 
als {pk}k>o are orthogonal polynomials correspond- 
ing to the marginal density m. They are Chebyshev 
polynomials of the second kind, given by the identity 

sin((A: + 1)^) 



Pkicos{6)) 



sm{e) 



0<d<TT,k>0. 



Using present theory we have the following theorem. 

Theorem 6.1. For the x-chain from the Gibbs 
sampler for the density (6.7); 

(a) xl{l) = j:T=ij2khwPlki^)- 

(b) Forx = 0, xUl) = Ek=ij2kTTW 
Thus 



+ 1\ 1 



(c) For \x\ - 
Thus 

1 



<xl{i)< 



81 



2 y ■ 

1 

(2fc+l)4'- 

-)- 

6 J 34'- 



2 ■ 



One of the hallmarks of our examples is that they 
are statistically natural. It is an open problem to 
give natural probabilistic interpretations of a gen- 
eral Lancaster family. There are some natural ex- 
amples. Eagleson [36] has shown that Wi + W2 and 
W2 + W3 are Lancaster if Wi,W2, W3 are indepen- 
dent and from one of the six quadratic families. Nat- 
ural Markov chains with polynomial eigenfunctions 
have been extensively studied in mathematical ge- 
netics literature. This work, which perhaps begins 



with [42], was unified in [14]. See [38] for a text- 
book treatment. Models of Fisher- Wright, Moran, 
Kimura, Karlin and McGregor are included. While 
many models are either absorbing, nonreversible, or 
have intractable stationary distributions, there are 
also tractable new models to be found. See the Stan- 
ford thesis work of Hua Zhou. 

Interesting classes of reversible Markov chains with 
explicit polynomial eigenfunctions appear in the work 
of Hoare, Rahman and their collaborators [20, 58, 
59, 60]. These seem distinct from the present exam- 
ples, are grounded in physics problems (transfer of 
vibrational energy between polyatomic molecules) 
and involve some quite exotic eigenfunctions (9 — j 
symbols). Their results are explict and it seems like a 
worthwhile project to convert them into sharp rates 
of convergence. 

A rather different class of examples can be cre- 
ated using autoregressive processes. For definiteness, 
work on the real line M. Consider processes of form 
Xq = 0, and for 1 < n < 00, 

Xn+l = CLn+lXn + £n+l, 

with {{ai,ei)}i>i independent and identically dis- 
tributed. Under mild conditions on the distribution 
of (aj,ej), the Markov chain X„ has a unique sta- 
tionary distribution m which can be represented as 
the probability distribution of 



X^ 



■ eo + oqEi + aiaoe2 + 



The point here is that for any k such that moments 
exist 

E{X'^\Xo = x)= E{{aix + eif) 



k 

E 

i=0 



x'E{a\e\~'). 



If, for example, the stationary distribution m has 
moments of all orders and is determined by those 
moments, then the Markov chain {X„}^q is gener- 
ated by a compact operator with eigenvalues E{a\)^ 
< i < 00, and polynomial eigenfunctions. 

We have treated the Gaussian case in Section 4.5. 
At the other extreme, take |a| < 1 constant and let £{ 
take values ±1 with probability 1/2. The fine prop- 
erties of TT have been intensively studied as Bernoulli 
convolutions. See [23] and the references there. For 
example, if a = 1/2, then vr is the usual uniform dis- 
tribution on [— 1 , 1] and the polynomials are Cheby- 
shev polynomials. Unfortunately, for any value of 
a 7^ 0, in the ±1 case, the distribution vr is known 
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to be continuous while the distribution of Xn is dis- 
crete and so does not converge to vr in or L^. We 
do not know how to use the eigenvalues to get quan- 
titative rates of convergence in one of the standard 
metrics for weak convergence. 

As a second example take (a, e) = (u, 0) with prob- 
ability p and (1 + u, —u) with probability 1 — p with 
u uniform on (0, 1) and p fixed in (0, 1). This Markov 
chain has a Beta(p, 1 — p) stationary density. The 
eigenvalues are l/{k + 1), 1 <k < oo. It has poly- 
nomial eigenfunctions. Alas, it is not reversible and 
again we do not know how to use the spectral infor- 
mation to get usual rates of convergence. See [23] or 
[75] for more information about this so-called "don- 
key chain." 

Finally, we mention the work of Hassairi and Zarai 
[57] which develops the orthogonal polynomials for 
cubic (and other) exponential families such as the 
inverse Gaussian. They introduce a novel notion of 
2-orthogonality. It seems possible (and interesting) 
to use their tools to handle the Gibbs sampler for 
conjugate priors for cubic families. 

6.2 Alternating Conditional Expectations 

The alternating conditional expectations that un- 
derlie the Gibbs sampler arise in other parts of prob- 
ability and statistics. These include classical canoni- 
cal correlations, especially those abstracted by Daux- 
ois and Pousse [21]. This last paper contains a frame- 
work for studying our problems where the associated 
operators are not compact. 

A nonlinear version of canonical correlations was 
developed by Breiman and Jerry Friedman as the 
A.C.E. algorithm. Buja [13] pointed out the connec- 
tions to Lancaster families. He found several other 
parallel developments, particularly in electrical en- 
gineering ([13], Section 7-12). Many of these come 
with novel, explicit examples which are grist for our 
mill. Conversely, our development gives new exam- 
ples for understanding the alternating conditional 
expectations that are the central focus of [13]. 

There is a classical subject built around alternat- 
ing projections and the work of von Neumann. See 
Deutsch [22]. Let Tii and H2 be closed subspaces 
of a Hilbert space 7i. Let Pi, P2 be the orthogonal 
projection onto Wi, 0.2 and let Pi be the orthogo- 
nal projection onto 7ii H 7^2- von Neumann showed 
that (i-i-P2)" Pi as n tends to infinity. That is 
\\{PiP2T{x) - Pi{x)\\ 0. In [24] we show that the 
Gibbs sampler is a special case of von Neumann's 
algorithm; with H = L^{a),ni = L'^in),'H2 = L'^{v) 



using the notation of Section 6.1. We develop the 
tools used to quantify rates of convergence for von 
Neumann's algorithm for use with the Gibbs sam- 
pler in [24]. 

6.3 Multivariate Models 

The present paper and its companion paper [25] 
have discussed univariate models. There are a num- 
ber of models with x or 6 multivariate where the as- 
sociated Markov chains have polynomial eigenfunc- 
tions. Some analogs of the six exponential families 
are developed in [15]. In Koudou and Pommeret [69], 
the Lancaster theory for these families is elegantly 
developed. Their work can be used to give the eigen- 
values and eigenfunctions for the multivariate ver- 
sions of the location models in Section 5. In their 
work, families of multivariate polynomials depend- 
ing on a parameter matrix are given. For our ex- 
amples, specific forms must be chosen. In a series 
of papers, developed independently of [69], Griffiths 
[51, 53] has developed such specific bases and illu- 
minated their properties. This work is turned into 
sharp rates of convergence for two-component mul- 
tivariate Gibbs samplers in the Stanford thesis work 
of Kshitij Khare and Hua Zhou. 

An important special case, high-dimensional 
Gaussian distributions, has been studied in [2, 50]. 
Here is a brief synopsis of these works. Let m(x) be 
a p-dimensional normal density with mean and 
covariance S [i.e., A'p(/i,S)]. A Markov chain with 
stationary density m may be written as 

(6.8) Xn+i = AXn + Bv + Cen+i. 

Here e„ has a Np{0,I) distribution, v = and 
the matrices A,B,C have the form 

A = -{D + L)-^L^, 

B = {D + Ly\ 

C = {D + L)-^D^/'^, 

where D and L are the diagonal and lower triangular 
parts of The chain (6.8) is reversible if and only 
if ATi = TiA^ . If this holds, A has real eigenvalues 
(Ai, A2, . . . , Ap). In [50], Goodman and Sokal show 
that the Markov chain (6.8) has eigenvalues A^ and 
eigenfunctions Hk for K = {ki,k2, ■ ■ ■ , kp), ki > 0, 
with 

A^ = ^A^, HKix) = f[HkSz^), 

i=l i=l 
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where Z = P'^T.-^/'^X and {Hk] are the usual 
one-dimensional Hermite polynomials. Here 
= PDP^ is the eigendecomposition of 
Goodman and Sokal show how a vari- 
ety of stochastic algorithms, including the system- 
atic scan Gibbs sampler for sampling from m, are 
covered by this framework. Explicit rates of conver- 
gence for this Markov chain can be found in the 
Stanford thesis work of Kshitij Khare. 

6.4 Conclusion 

The present paper studies rates of convergence us- 
ing spectral theory. In a companion paper [25] we 
develop a stochastic approach which uses one eigen- 
function combined with coupling. This is possible 
when the Markov chains are stochastically mono- 
tone. We show this is the case for all exponential 
families, with any choice of prior, and for location 
families where the density g{x) is totally positive of 
order 2. This lets us give rates of convergence for 
the examples of Section 4 when moments do not ex- 
ist (negative binomial, gamma, hyperbolic). In ad- 
dition, location problems fall into the setting of it- 
erated random functions so that backward iteration 
and coupling are available. See [17, 23] for extensive 
references. 

APPENDIX: PROOF OF THEOREM 3.1 

The proof will follow from the two dual lemmas 
below which show that the expectation operators Eq 
and Ex each take one orthogonal polynomial fam- 
ily into the other. For the Beta/Binomial example 
treated in the Introduction and in Section 4.1, these 
operators relate Hahn polynomials on {0, . . . ,n} to 
Jacobi polynomials on (0, 1). These facts are of in- 
dependent interest and some have been observed be- 
fore. See [62], (3.7) for the correspondence between 
Hahn and Jacobi polynomials and for a host of fur- 
ther references. 

Lemma A1. Ee[pk{X)] = mqkiO), 0<k<c. 

Proof. For A; = 0, Eg[po] = 1 = rjoqo. If < k < 
c, then for <i < k, the unconditional expectation 
is given by 

E[e%{X)]=E[pk{X)Ex{e')]=E[pk{X)p{X)] 

with p a polynomial of degree i < k. Since < z < 
k<c, E[pk{X)p{X)] = by orthogonahty. Thus = 
EfpkiX)]=E[e'Ee{pk{X))]. By assumption (H2), 
^ E0\pk{X)] is a monic polynomial of degree k in 



9. Since it is orthogonal to all polynomials of degree 
less than k, we must have Eg[pi:{X)] =r]i.qi.{9). □ 

The second lemma is dual to the first. 

Lemma A2. Ex[qk{0)] = fikPk{x), < k < c. If 
c < oo, Ex{qk{0)) = for k>c. 

Proof. The first part is proved as per Lemma 
Al. If c < oo, and k > c, by the same argument 
we have, for < j < c, E[pj{X)Ex[qk{0)]] = 0. But 
{Pj}o<j<c form abasis for L'^{m{dx)), and Ex[qk{0)] S 
L?'{m{dx)) since 

E[{Exqk{0)f]<E[ql{e)]<^. 

It follows that Ex[qk{0)] =0. □ 

Proof of part (a) of Theorem 3.1. Suppose 
< A; < c. From the definitions, the x-chain operates 
on pk as 

Ex[Ee{Pk{X'))] = Ex[r]kqk{0)] = WkPkix) 

with equalities from Lemmas Al, A2. Hence, r/fc/i^ 
are eigenvalues of the x-chain with pk as eigenfunc- 
tions. This proves (a). □ 

Proof of part (b). Suppose first < A: < c. 
Then, arguing as above, ^kVk are eigenvalues of the 
0-chain with qk as eigenvectors. If c = oo, we are 
done. If c < oo, then, for k > c, Lemma A2 shows 
that qk is an eigenfunction for the ^-chain with eigen- 
value zero. □ 

Proof of part (c). From the development in 
Section 2.1, the random scan chain K takes L^{P) 
into L2(m) + L'^in) C L'^{P) and ker K D {L'^{m) + 
L2(^))^. We have 

Kg{X,e) = ^Ex[g{x,d')] + '^Ee[g{X' ,9)]. 

For < A; < c, consider K acting on Pk{x) + 'i'fc(^)- 
The result is 

Up,{x)+Ex[qk{e%[^) 
+ UEe[pk{x)] + M qk{e)) 

= Q + IvVkf^k^ (pk{x) + qk{0)^ ■ 

Similarly, 

K(pk- qk){x,0) 

= {l-lVw){pk{x)-^^qk{9)). 
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Suppose first that c < cxd. For k > c, Lemma A2 
shows ExQkiO) = for all x. Thus Kqk{x,9) = ^qk{d). 
Further 

spanipfc(x) ± ./— qk{9) 

0<k<c, qk{0) c<k< cx)| 

= span{pfc(x) < A; < c, qk{0), < k < oo} 

= L2(m) +L2(7r). 

It follows that K is diagonalizable with eigenval- 
ues / eigenvectors 

\±\y/Jl0%, pk{x) ± ^J^qk{0) foTO<k<c, 

^, qk{(^) forc<A;<oo, 

and Kg = for g G {L'^{m) + L'^{-k))^. 

Suppose next that c = cxd; then K is diagonalizable 
with eigenvalues / eigenfunctions 

(]- ± y/r]klJik] , pk{x) ± qk{0), 0<k<oo. 

/ V /^fc 

Again, span {pk{x) ± qk(.0) < k < c} = 

span {pk{x),qkiO)} = L^{m) + L'^{tt) and Kg = 
for g G {L?{m) + L^(7r))-'-. This completes the proof 
of (c). □ 
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